Subset and manipulate data ************************** Subset to specific samples ########################## .. code-block:: python subset.samples(obj, var='index', slist='None', keep0=False): Returns an object subsetted to the selected samples. *obj* is the object generated with files.load *var* is the column heading in the meta data used for subsetting, if var=’index’ the actual sample names will be used *slist* is a list of samples or meta data labels to keep if *keep0* =True, OTUs/ASVs, which have zero reads associated with them after subsetting will be kept, otherwise they will be removed from the data. Subset to specific OTUs/ASVs ############################ .. code-block:: python subset.sequences(obj, svlist) Returns an object subsetted to the selected OTUs/ASVs. *obj* is the object generated with files.load *svlist* is a list of OTU or ASV names that should be kept in the data set. Subset to the most abundant OTUs/ASVs ##################################### .. code-block:: python subset.abundant_sequences(obj, number=25, method='sum') Returns an object with only the most abundant OTUs/ASVs. *obj* is the object generated with files.load *number* specifies the number of ASVs to keep *method* is the method used to rank OTUs/ASVs. If *method* ='sum', the OTUs/ASVs are ranked based on the sum of the relative abundances in all samples. If *method* ='max', they are ranked based on the max relative abundance in a sample. Subset based on taxonomic classification ######################################### .. code-block:: python subset.text_patterns(obj, subsetLevels=[], subsetPatterns=[]): Searches for specific text patterns among the taxonomic classifications. Returns an object subsetted to OTUs/ASVs matching those text patterns. *subsetLevels* is a list taxonomic levels in which text patterns are searched for, e.g. ['Family', 'Genus'] *subsetPatterns* is a list of text patterns to search for, e.g. ['Nitrosom', 'Brochadia'] Merge samples ############## .. code-block:: python subset.merge_samples(obj, var='None', slist='None', keep0=False) Returns an object where samples belonging the same category (as defined in the meta data) have been merged. *var* is the column heading in metadata used to merge samples, the read counts for all samples with the same text in var column will be merged *slist* is a list of names in meta data column which specify samples to keep, if slist='None' (default), all samples are kept if *keep0* =False, all OTUs/ASVs with 0 counts after merging will be discarded from the data. Rarefy ###### .. code-block:: python subset.rarefy_table(tab, depth='min', seed='None', replacement=False) subset.rarefy_object(obj, depth='min', seed='None', replacement=False): Rarefies a count table to a specific number of reads per sample. The function subset.rarefy_table() operates only on the count table and returns only a rarefied table. The function subset.rarefy_object() operates on the whole object and returns a whole object. This means that samples and OTUs/ASVs which might have been dropped from the count table during rarefaction are also dropped from the 'ra', 'tax', 'seq', and 'meta' dataframes of the object. *tab* is the count table to be rarefied *object* is the object containing the count table to be rarefied if *depth* ='min', the minimum number of reads in a sample is used as rarefaction depth, otherwise a number can be specified *seed* sets a random state for reproducible results, use an integer. if *replacement* =False, the function is similar to rarefaction without replacement, if *replacement* =True, it does rarefaction with replacement. Consensus table ############### .. code-block:: python subset.consensus(objlist, keepObj='best', taxa='None', alreadyAligned=False, differentLengths=False, nameType='ASV', onlyReturnSeqs=False) Takes a list of objects and returns a consensus object based on ASVs found in all. Information about the fraction of reads retained from the original objects is also provided. *objlist* is a list of objects *keepObj* makes it possible to specify which object in objlist that should be kept after filtering based on common ASVs, specify with integer (0 is the first object, 1 is the second, etc), ‘best’ means that the object which has the highest fraction of its reads mapped to the common ASVs is kept *taxa* makes it possible to specify with an integer the object having taxa information that should be kept (0 is the first object, 1 is the second, etc), if 'None', the taxa information in the kept object is used if *alreadyAligned* =True, the subset.align_sequences function has already been run on the objects to make sure the same sequences in different objects have the same names if *differentLengths* =True, it assumes that the same ASV inferred with different bioinformatics pipelines could have different sequence lengths. *nameType* is the label used for sequences (e.g. ASV or OTU) if *onlyReturnSeqs* =True, only a dataframe with the shared ASVs is returned. Example .. code-block:: python import qdiv cons_obj, info = qdiv.subset.consensus([obj1, obj2]) qdiv.stats.print_info(cons_obj) print(info) In the example above, *cons_obj* is the new consensus object constructed based on obj1 and obj2. *info* contains information about the fraction of reads retained from obj1 and obj2, as well as the maximum relative abundance of reads lost in a sample in each of the original objects. .. code-block:: python import qdiv shared_seqs, info = qdiv.subset.consensus([obj1, obj2], onlyReturnSeqs=True) In the example above, *shared_seqs* is a pandas dataframe with the shared sequences *info* just contains a text string saying that the shared ASVs were returned. Merge objects ############# .. code-block:: python subset.merge_objects(objlist, alreadyAligned=False, differentLengths=False, nameType='ASV') Takes a list of objects and a merged objects including all OTUs/ASVs and samples. *objlist* is a list of objects if *alreadyAligned* =True, the subset.align_sequences function has already been run on the objects to make sure the same sequences in different objects have the same names if *differentLengths* =True, it assumes that the same ASV inferred with different bioinformatics pipelines could have different sequence lengths. *nameType* is the label used for sequences (e.g. ASV or OTU) Example .. code-block:: python import qdiv merged_obj = qdiv.subset.merge_objects([obj1, obj2]) qdiv.stats.print_info(merged_obj) In the example above, *merged_obj* is the new object constructed by combining obj1 and obj2.